Search CORE

18 research outputs found

TwistBytes - identification of Cuneiform languages and German dialects at VarDial 2019

Author: Benites de Azevedo e Souza Fernando
Cieliebak Mark
von Däniken Pius
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%

Crossref

ZHAW digitalcollection

TRANSLIT : a large-scale name transliteration resource

Author: Benites de Azevedo e Souza Fernando
Cieliebak Mark
Duivesteijn Gilbert François
von Däniken Pius
Publication venue: European Language Resources Association
Publication date: 01/05/2020
Field of study

Transliteration is the process of expressing a proper name from a source language in the characters of a target language (e.g. from Cyrillic to Latin characters). We present TRANSLIT, a large-scale corpus with approx. 1.6 million entries in more than 180 languages with about 3 million variations of person and geolocation names. The corpus is based on various public data sources, which have been transformed into a unified format to simplify their usage, plus a newly compiled dataset from Wikipedia. In addition, we apply several machine learning methods to establish baselines for automatically detecting transliterated names in various languages. Our best systems achieve an accuracy of 92\% on identification of transliterated pairs

ZHAW digitalcollection

Methods of NLP in arts management

Author: Benites de Azevedo e Souza Fernando
Betzler Diana
Cieliebak Mark
Hnizda Michaela
Leuschen Lara
Publication venue: ZHAW Zürcher Hochschule für Angewandte Wissenschaften
Publication date: 01/01/2019
Field of study

The boost of digital archives and libraries in art, literature, and music; the shift of cultural marketing and cultural criticism to social media and online platforms; and the emergence of new digital art and cultural products lead to an enormous increase in digital data, creating challenges as well as opportunities for arts management practitioners and researchers. For arts practitioners, NLP can be used to improve marketing and communication for target group analysis, event evaluation, (social) media analysis, pricing, social media optimization, advertisement targeting, or search engine optimization. In the field of archives, collections, and libraries, NLP can contribute to the improvement of indexing, consistency, and quality of databases as well as the development of suitable search algorithms. In the distribution of cultural products, online platforms can be improved and the markets analyzed

ZHAW digitalcollection

ZHAW-InIT : social media geolocation at VarDial 2020

Author: Benites de Azevedo e Souza Fernando
Cieliebak Mark
Hürlimann Manuela
von Däniken Pius
Publication venue: International Committee on Computational Linguistics (ICCL)
Publication date: 13/12/2020
Field of study

We describe our approaches for the Social Media Geolocation (SMG) task at the VarDial Evaluation Campaign 2020. The goal was to predict geographical location (latitudes and longitudes) given an input text. There were three subtasks corresponding to German-speaking Switzerland (CH), Germany and Austria (DE-AT), and Croatia, Bosnia and Herzegovina, Montenegro and Serbia (BCMS). We submitted solutions to all subtasks but focused our development efforts on the CH subtask, where we achieved third place out of 16 submissions with a median distance of 15.93 km and had the best result of 14 unconstrained systems. In the DE-AT subtask, we ranked sixth out of ten submissions (fourth of 8 unconstrained systems) and for BCMS we achieved fourth place out of 13 submissions (second of 11 unconstrained systems)

ZHAW digitalcollection

Twist Bytes : German dialect identification with data mining optimization

Author: Benites de Azevedo e Souza Fernando
Cieliebak Mark
Deriu Jan Milan
Grubenmann Ralf
von Däniken Pius
von Grünigen Dirk
Publication venue: VarDial
Publication date: 01/01/2018
Field of study

We describe our approaches used in the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2018. The goal was to identify to which out of four dialects spoken in German speaking part of Switzerland a sentence belonged to. We adopted two different metaclassifier approaches and used some data mining insights to improve the preprocessing and the meta-classifier parameters. Especially, we focused on using different feature extraction methods and how to combine them, since they influenced the performance very differently of the system. Our system achieved second place out of 8 teams, with a macro averaged F-1 of 64.6%. We also participated on the surprise dialect task with a multi-label approach

ZHAW digitalcollection

CEASR : a corpus for evaluating automatic speech recognition

Author: Benites de Azevedo e Souza Fernando
Cieliebak Mark
Gedik Esin
Germann Fabian
Hürlimann Manuela
Ulasik Malgorzata Anna
Publication venue: European Language Resources Association
Publication date: 01/01/2020
Field of study

In this paper, we present CEASR, a Corpus for Evaluating ASR quality. It is a data set derived from public speech corpora, containing manual transcripts enriched with metadata along with transcripts generated by several modern state-of-the-art ASR systems. CEASR provides this data in a unified structure, consistent across all corpora and systems with normalised transcript texts and metadata. We then use CEASR to evaluate the quality of ASR systems on the basis of their Word Error Rate (WER). Our experiments show, among other results, a substantial difference in quality between commercial versus open-source ASR tools and differences up to a factor of ten for single systems on different corpora. By using CEASR, we could very efficiently and easily obtain these results. This shows that our corpus enables researchers to perform ASR-related evaluations and various in-depth analyses with noticeably reduced effort: without the need to collect, process and transcribe the speech data themselves

ZHAW digitalcollection

ZHAW-InIT at GermEval 2020 task 4 : low-resource speech-to-text

Author: Benites de Azevedo e Souza Fernando
Büchi Matthias
Cieliebak Mark
Hürlimann Manuela
Ulasik Malgorzata Anna
von Däniken Pius
Publication venue: CEUR Workshop Proceedings
Publication date: 01/06/2020
Field of study

This paper presents the contribution of ZHAW-InIT to Task 4 ”Low-Resource STT” at GermEval 2020. The goal of the task is to develop a system for translating Swiss German dialect speech into Standard German text in the domain of parliamentary debates. Our approach is based on Jasper, a CNN Acoustic Model, which we fine-tune on the task data. We enhance the base system with an extended Language Model containing in-domain data and speed perturbation and run further experiments with post-processing. Our submission achieved first place with a final Word Error Rate of 40.29%

ZHAW digitalcollection

Design patterns for resource-constrained automated deep-learning methods

Author: Amirian Mohammadreza
Benites de Azevedo e Souza Fernando
Gupta Prakhar
Schilling Frank-Peter
Stadelmann Thilo
Tuggener Lukas
von Däniken Pius
Publication venue: 'MDPI AG'
Publication date: 06/11/2020
Field of study

We present an extensive evaluation of a wide variety of promising design patterns for automated deep-learning (AutoDL) methods, organized according to the problem categories of the 2019 AutoDL challenges, which set the task of optimizing both model accuracy and search efficiency under tight time and computing constraints. We propose structured empirical evaluations as the most promising avenue to obtain design principles for deep-learning systems due to the absence of strong theoretical support. From these evaluations, we distill relevant patterns which give rise to neural network design recommendations. In particular, we establish (a) that very wide fully connected layers learn meaningful features faster; we illustrate (b) how the lack of pretraining in audio processing can be compensated by architecture search; we show (c) that in text processing deep-learning-based methods only pull ahead of traditional methods for short text lengths with less than a thousand characters under tight resource limitations; and lastly we present (d) evidence that in very data- and computing-constrained settings, hyperparameter tuning of more traditional machine-learning methods outperforms deep-learning systems

Multidisciplinary Digital Publishing Institute

Infoscience - École polytechnique fédérale de Lausanne

ZHAW digitalcollection